{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"name":"NLU_training_multi_class_text_classifier_demo_hotel_reviews.ipynb","provenance":[],"collapsed_sections":["zkufh760uvF3"]},"kernelspec":{"name":"python3","display_name":"Python 3"}},"cells":[{"cell_type":"markdown","metadata":{"id":"zkufh760uvF3"},"source":["![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)\n","\n","[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/nlu/blob/master/examples/colab/Training/multi_class_text_classification/NLU_training_multi_class_text_classifier_demo_hotel_reviews.ipynb)\n","\n","\n","\n","# Training a Deep Learning Classifier with NLU \n","## ClassifierDL (Multi-class Text Classification)\n","## 3 class Tripadvisor Hotel review classifier training\n","With the [ClassifierDL model](https://nlp.johnsnowlabs.com/docs/en/annotators#classifierdl-multi-class-text-classification) from Spark NLP you can achieve State Of the Art results on any multi class text classification problem \n","\n","This notebook showcases the following features : \n","\n","- How to train the deep learning classifier\n","- How to store a pipeline to disk\n","- How to load the pipeline from disk (Enables NLU offline mode)\n","\n","You can achieve these results or even better on this dataset with training data:\n","\n","<br>\n","\n","![image.png]()\n","\n","You can achieve these results or even better on this dataset with test data:\n","\n","<br>\n","\n","\n","![image.png]()\n"]},{"cell_type":"markdown","metadata":{"id":"dur2drhW5Rvi"},"source":["# 1. Install Java 8 and NLU"]},{"cell_type":"code","metadata":{"id":"hFGnBCHavltY","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1620191227160,"user_tz":-300,"elapsed":116826,"user":{"displayName":"Gammer Otaku","photoUrl":"","userId":"18042713576744284398"}},"outputId":"f2dc1e9c-3872-46ef-ccc3-9c6682cdf498"},"source":["!wget https://setup.johnsnowlabs.com/nlu/colab.sh -O - | bash\n","  \n","\n","import nlu"],"execution_count":null,"outputs":[{"output_type":"stream","text":["--2021-05-05 05:05:11--  https://raw.githubusercontent.com/JohnSnowLabs/nlu/master/scripts/colab_setup.sh\n","Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...\n","Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n","HTTP request sent, awaiting response... 200 OK\n","Length: 1671 (1.6K) [text/plain]\n","Saving to: ‘STDOUT’\n","\n","\r-                     0%[                    ]       0  --.-KB/s               Installing  NLU 3.0.0 with  PySpark 3.0.2 and Spark NLP 3.0.1 for Google Colab ...\n","\r-                   100%[===================>]   1.63K  --.-KB/s    in 0.001s  \n","\n","2021-05-05 05:05:11 (1.65 MB/s) - written to stdout [1671/1671]\n","\n","\u001b[K     |████████████████████████████████| 204.8MB 72kB/s \n","\u001b[K     |████████████████████████████████| 153kB 54.3MB/s \n","\u001b[K     |████████████████████████████████| 204kB 22.3MB/s \n","\u001b[K     |████████████████████████████████| 204kB 53.7MB/s \n","\u001b[?25h  Building wheel for pyspark (setup.py) ... \u001b[?25l\u001b[?25hdone\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"f4KkTfnR5Ugg"},"source":["# 2. Download hotel reviews  dataset \n","https://www.kaggle.com/andrewmvd/trip-advisor-hotel-reviews\n","\n","Hotels play a crucial role in traveling and with the increased access to information new pathways of selecting the best ones emerged.\n","With this dataset, consisting of 20k reviews crawled from Tripadvisor, you can explore what makes a great hotel and maybe even use this model in your travels!\n"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"OrVb5ZMvvrQD","executionInfo":{"status":"ok","timestamp":1620191228830,"user_tz":-300,"elapsed":118485,"user":{"displayName":"Gammer Otaku","photoUrl":"","userId":"18042713576744284398"}},"outputId":"c370ca94-56e6-4b3a-e205-d38162a54d13"},"source":["! wget http://ckl-it.de/wp-content/uploads/2021/01/tripadvisor_hotel_reviews.csv\n"],"execution_count":null,"outputs":[{"output_type":"stream","text":["--2021-05-05 05:07:06--  http://ckl-it.de/wp-content/uploads/2021/01/tripadvisor_hotel_reviews.csv\n","Resolving ckl-it.de (ckl-it.de)... 217.160.0.108, 2001:8d8:100f:f000::209\n","Connecting to ckl-it.de (ckl-it.de)|217.160.0.108|:80... connected.\n","HTTP request sent, awaiting response... 200 OK\n","Length: 5160790 (4.9M) [text/csv]\n","Saving to: ‘tripadvisor_hotel_reviews.csv’\n","\n","tripadvisor_hotel_r 100%[===================>]   4.92M  4.01MB/s    in 1.2s    \n","\n","2021-05-05 05:07:08 (4.01 MB/s) - ‘tripadvisor_hotel_reviews.csv’ saved [5160790/5160790]\n","\n"],"name":"stdout"}]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":419},"id":"y4xSRWIhwT28","executionInfo":{"status":"ok","timestamp":1620191230311,"user_tz":-300,"elapsed":119954,"user":{"displayName":"Gammer Otaku","photoUrl":"","userId":"18042713576744284398"}},"outputId":"94899131-dea3-434a-f4b9-5110a8a48653"},"source":["import pandas as pd\n","test_path = '/content/tripadvisor_hotel_reviews.csv'\n","train_df = pd.read_csv(test_path,sep=\",\")\n","cols = [\"y\",\"text\"]\n","train_df = train_df[cols]\n","from sklearn.model_selection import train_test_split\n","\n","train_df, test_df = train_test_split(train_df, test_size=0.2)\n","train_df\n","\n"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/html":["<div>\n","<style scoped>\n","    .dataframe tbody tr th:only-of-type {\n","        vertical-align: middle;\n","    }\n","\n","    .dataframe tbody tr th {\n","        vertical-align: top;\n","    }\n","\n","    .dataframe thead th {\n","        text-align: right;\n","    }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\n","  <thead>\n","    <tr style=\"text-align: right;\">\n","      <th></th>\n","      <th>y</th>\n","      <th>text</th>\n","    </tr>\n","  </thead>\n","  <tbody>\n","    <tr>\n","      <th>577</th>\n","      <td>average</td>\n","      <td>decent hotel decent price stayed 5 nights delu...</td>\n","    </tr>\n","    <tr>\n","      <th>5746</th>\n","      <td>great</td>\n","      <td>good said previous posts small tasteful renova...</td>\n","    </tr>\n","    <tr>\n","      <th>675</th>\n","      <td>great</td>\n","      <td>gold floor best stayed gold floor club floor f...</td>\n","    </tr>\n","    <tr>\n","      <th>4415</th>\n","      <td>poor</td>\n","      <td>truly awful, admit slightly sceptical arriving...</td>\n","    </tr>\n","    <tr>\n","      <th>4099</th>\n","      <td>great</td>\n","      <td>union square jewel loved hotel ammenities loca...</td>\n","    </tr>\n","    <tr>\n","      <th>...</th>\n","      <td>...</td>\n","      <td>...</td>\n","    </tr>\n","    <tr>\n","      <th>6318</th>\n","      <td>average</td>\n","      <td>average read travellers wrote hotel prior leav...</td>\n","    </tr>\n","    <tr>\n","      <th>3855</th>\n","      <td>great</td>\n","      <td>plush comfortable stayed nights mark hopkins f...</td>\n","    </tr>\n","    <tr>\n","      <th>1061</th>\n","      <td>average</td>\n","      <td>great potential just want special thanks anima...</td>\n","    </tr>\n","    <tr>\n","      <th>3060</th>\n","      <td>average</td>\n","      <td>centrally located hotel enjoyable stay stayed ...</td>\n","    </tr>\n","    <tr>\n","      <th>5239</th>\n","      <td>average</td>\n","      <td>distinctly average stayed short trip hong kong...</td>\n","    </tr>\n","  </tbody>\n","</table>\n","<p>5241 rows × 2 columns</p>\n","</div>"],"text/plain":["            y                                               text\n","577   average  decent hotel decent price stayed 5 nights delu...\n","5746    great  good said previous posts small tasteful renova...\n","675     great  gold floor best stayed gold floor club floor f...\n","4415     poor  truly awful, admit slightly sceptical arriving...\n","4099    great  union square jewel loved hotel ammenities loca...\n","...       ...                                                ...\n","6318  average  average read travellers wrote hotel prior leav...\n","3855    great  plush comfortable stayed nights mark hopkins f...\n","1061  average  great potential just want special thanks anima...\n","3060  average  centrally located hotel enjoyable stay stayed ...\n","5239  average  distinctly average stayed short trip hong kong...\n","\n","[5241 rows x 2 columns]"]},"metadata":{"tags":[]},"execution_count":3}]},{"cell_type":"markdown","metadata":{"id":"0296Om2C5anY"},"source":["# 3. Train Deep Learning Classifier using nlu.load('train.classifier')\n","\n","You dataset label column should be named 'y' and the feature column with text data should be named 'text'"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":1000},"id":"3ZIPkRkWftBG","executionInfo":{"status":"ok","timestamp":1620191368058,"user_tz":-300,"elapsed":257686,"user":{"displayName":"Gammer Otaku","photoUrl":"","userId":"18042713576744284398"}},"outputId":"4e41c7cc-b230-4cc6-a819-2b540d620b2b"},"source":["# load a trainable pipeline by specifying the train. prefix  and fit it on a datset with label and text columns\n","# Since there are no\n","\n","trainable_pipe = nlu.load('train.classifier')\n","fitted_pipe = trainable_pipe.fit(train_df.iloc[:50] )\n","\n","\n","# predict with the trainable pipeline on dataset and get predictions\n","preds = fitted_pipe.predict(train_df.iloc[:50] ,output_level='document')\n","preds"],"execution_count":null,"outputs":[{"output_type":"stream","text":["tfhub_use download started this may take some time.\n","Approximate size to download 923.7 MB\n","[OK!]\n","sentence_detector_dl download started this may take some time.\n","Approximate size to download 354.6 KB\n","[OK!]\n"],"name":"stdout"},{"output_type":"execute_result","data":{"text/html":["<div>\n","<style scoped>\n","    .dataframe tbody tr th:only-of-type {\n","        vertical-align: middle;\n","    }\n","\n","    .dataframe tbody tr th {\n","        vertical-align: top;\n","    }\n","\n","    .dataframe thead th {\n","        text-align: right;\n","    }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\n","  <thead>\n","    <tr style=\"text-align: right;\">\n","      <th></th>\n","      <th>trained_classifier_confidence_confidence</th>\n","      <th>text</th>\n","      <th>y</th>\n","      <th>sentence</th>\n","      <th>sentence_embedding_use</th>\n","      <th>document</th>\n","      <th>trained_classifier</th>\n","      <th>origin_index</th>\n","    </tr>\n","  </thead>\n","  <tbody>\n","    <tr>\n","      <th>0</th>\n","      <td>0.545426</td>\n","      <td>decent hotel decent price stayed 5 nights delu...</td>\n","      <td>average</td>\n","      <td>[decent hotel decent price stayed 5 nights del...</td>\n","      <td>[0.053218383342027664, 0.04507320374250412, 0....</td>\n","      <td>decent hotel decent price stayed 5 nights delu...</td>\n","      <td>poor</td>\n","      <td>577</td>\n","    </tr>\n","    <tr>\n","      <th>1</th>\n","      <td>0.697417</td>\n","      <td>good said previous posts small tasteful renova...</td>\n","      <td>great</td>\n","      <td>[good said previous posts small tasteful renov...</td>\n","      <td>[0.039876967668533325, 0.06624795496463776, -0...</td>\n","      <td>good said previous posts small tasteful renova...</td>\n","      <td>great</td>\n","      <td>5746</td>\n","    </tr>\n","    <tr>\n","      <th>2</th>\n","      <td>0.671591</td>\n","      <td>gold floor best stayed gold floor club floor f...</td>\n","      <td>great</td>\n","      <td>[gold floor best stayed gold floor club floor ...</td>\n","      <td>[0.0038577697705477476, 0.05996308475732803, -...</td>\n","      <td>gold floor best stayed gold floor club floor f...</td>\n","      <td>great</td>\n","      <td>675</td>\n","    </tr>\n","    <tr>\n","      <th>3</th>\n","      <td>0.969960</td>\n","      <td>truly awful, admit slightly sceptical arriving...</td>\n","      <td>poor</td>\n","      <td>[truly awful, admit slightly sceptical arrivin...</td>\n","      <td>[0.06336931884288788, 0.0006446511833928525, -...</td>\n","      <td>truly awful, admit slightly sceptical arriving...</td>\n","      <td>poor</td>\n","      <td>4415</td>\n","    </tr>\n","    <tr>\n","      <th>4</th>\n","      <td>0.668067</td>\n","      <td>union square jewel loved hotel ammenities loca...</td>\n","      <td>great</td>\n","      <td>[union square jewel loved hotel ammenities loc...</td>\n","      <td>[0.025744924321770668, 0.06057509407401085, 0....</td>\n","      <td>union square jewel loved hotel ammenities loca...</td>\n","      <td>great</td>\n","      <td>4099</td>\n","    </tr>\n","    <tr>\n","      <th>5</th>\n","      <td>0.516283</td>\n","      <td>affinia 50 great location room so-so affinia 5...</td>\n","      <td>average</td>\n","      <td>[affinia 50 great location room so-so affinia ...</td>\n","      <td>[0.06250237673521042, 0.009299570694565773, 0....</td>\n","      <td>affinia 50 great location room so-so affinia 5...</td>\n","      <td>poor</td>\n","      <td>2621</td>\n","    </tr>\n","    <tr>\n","      <th>6</th>\n","      <td>0.484881</td>\n","      <td>impressed stay travel area 3 x year business, ...</td>\n","      <td>poor</td>\n","      <td>[impressed stay travel area 3 x year business,...</td>\n","      <td>[0.05403365194797516, 0.05502327159047127, 0.0...</td>\n","      <td>impressed stay travel area 3 x year business, ...</td>\n","      <td>great</td>\n","      <td>2165</td>\n","    </tr>\n","    <tr>\n","      <th>7</th>\n","      <td>0.648219</td>\n","      <td>nothing memorable no complaints close train st...</td>\n","      <td>average</td>\n","      <td>[nothing memorable no complaints close train s...</td>\n","      <td>[0.03432552516460419, 0.05315956100821495, -0....</td>\n","      <td>nothing memorable no complaints close train st...</td>\n","      <td>great</td>\n","      <td>2522</td>\n","    </tr>\n","    <tr>\n","      <th>8</th>\n","      <td>0.561468</td>\n","      <td>not recomended chose hotel based recomendation...</td>\n","      <td>poor</td>\n","      <td>[not recomended chose hotel based recomendatio...</td>\n","      <td>[0.06976918131113052, 0.021174855530261993, -0...</td>\n","      <td>not recomended chose hotel based recomendation...</td>\n","      <td>poor</td>\n","      <td>5187</td>\n","    </tr>\n","    <tr>\n","      <th>9</th>\n","      <td>0.822802</td>\n","      <td>promising 2-day stay location couple weeks ear...</td>\n","      <td>average</td>\n","      <td>[promising 2-day stay location couple weeks ea...</td>\n","      <td>[0.05621741712093353, 0.025512538850307465, -0...</td>\n","      <td>promising 2-day stay location couple weeks ear...</td>\n","      <td>poor</td>\n","      <td>488</td>\n","    </tr>\n","    <tr>\n","      <th>10</th>\n","      <td>0.434855</td>\n","      <td>better thought, booked 2x queen bed 1x sofa be...</td>\n","      <td>average</td>\n","      <td>[better thought, booked 2x queen bed 1x sofa b...</td>\n","      <td>[0.06342098116874695, 0.02677888236939907, -0....</td>\n","      <td>better thought, booked 2x queen bed 1x sofa be...</td>\n","      <td>poor</td>\n","      <td>1944</td>\n","    </tr>\n","    <tr>\n","      <th>11</th>\n","      <td>0.639247</td>\n","      <td>gracious elegant husband recently spent nights...</td>\n","      <td>great</td>\n","      <td>[gracious elegant husband recently spent night...</td>\n","      <td>[0.05973555147647858, 0.056692928075790405, 0....</td>\n","      <td>gracious elegant husband recently spent nights...</td>\n","      <td>great</td>\n","      <td>4809</td>\n","    </tr>\n","    <tr>\n","      <th>12</th>\n","      <td>0.912950</td>\n","      <td>absoultely horrible family booked trip oct. 20...</td>\n","      <td>poor</td>\n","      <td>[absoultely horrible family booked trip oct., ...</td>\n","      <td>[-0.02683592215180397, 0.05384364351630211, 0....</td>\n","      <td>absoultely horrible family booked trip oct. 20...</td>\n","      <td>poor</td>\n","      <td>1825</td>\n","    </tr>\n","    <tr>\n","      <th>13</th>\n","      <td>0.607228</td>\n","      <td>helpful just returned night stay museum square...</td>\n","      <td>average</td>\n","      <td>[helpful just returned night stay museum squar...</td>\n","      <td>[0.04092474281787872, 0.047728583216667175, 0....</td>\n","      <td>helpful just returned night stay museum square...</td>\n","      <td>great</td>\n","      <td>1663</td>\n","    </tr>\n","    <tr>\n","      <th>14</th>\n","      <td>0.822926</td>\n","      <td>need refurbishment stayed 7 nights, hotel grea...</td>\n","      <td>poor</td>\n","      <td>[need refurbishment stayed 7 nights, hotel gre...</td>\n","      <td>[0.04403533786535263, 0.02540094219148159, 0.0...</td>\n","      <td>need refurbishment stayed 7 nights, hotel grea...</td>\n","      <td>poor</td>\n","      <td>1031</td>\n","    </tr>\n","    <tr>\n","      <th>15</th>\n","      <td>0.657127</td>\n","      <td>excellent choice great experience little bouti...</td>\n","      <td>great</td>\n","      <td>[excellent choice great experience little bout...</td>\n","      <td>[0.039970774203538895, 0.02647809125483036, 0....</td>\n","      <td>excellent choice great experience little bouti...</td>\n","      <td>great</td>\n","      <td>6332</td>\n","    </tr>\n","    <tr>\n","      <th>16</th>\n","      <td>0.508753</td>\n","      <td>agree previous comments partner just returned ...</td>\n","      <td>average</td>\n","      <td>[agree previous comments partner just returned...</td>\n","      <td>[-0.021695351228117943, 0.05956907197833061, 0...</td>\n","      <td>agree previous comments partner just returned ...</td>\n","      <td>poor</td>\n","      <td>4253</td>\n","    </tr>\n","    <tr>\n","      <th>17</th>\n","      <td>0.615702</td>\n","      <td>stayed hurricane katrina hit stayed best suite...</td>\n","      <td>great</td>\n","      <td>[stayed hurricane katrina hit stayed best suit...</td>\n","      <td>[0.023213421925902367, 0.06629981100559235, 0....</td>\n","      <td>stayed hurricane katrina hit stayed best suite...</td>\n","      <td>great</td>\n","      <td>2886</td>\n","    </tr>\n","    <tr>\n","      <th>18</th>\n","      <td>0.937799</td>\n","      <td>rating suspect, hotel sofitel june 6-7 power w...</td>\n","      <td>poor</td>\n","      <td>[rating suspect, hotel sofitel june 6-7 power ...</td>\n","      <td>[0.02413402684032917, 0.024369465187191963, 0....</td>\n","      <td>rating suspect, hotel sofitel june 6-7 power w...</td>\n","      <td>poor</td>\n","      <td>5804</td>\n","    </tr>\n","    <tr>\n","      <th>19</th>\n","      <td>0.960582</td>\n","      <td>star nightmare, stayed hotels boston ranging 2...</td>\n","      <td>poor</td>\n","      <td>[star nightmare, stayed hotels boston ranging ...</td>\n","      <td>[0.03687024489045143, 0.04627574235200882, -0....</td>\n","      <td>star nightmare, stayed hotels boston ranging 2...</td>\n","      <td>poor</td>\n","      <td>1390</td>\n","    </tr>\n","    <tr>\n","      <th>20</th>\n","      <td>0.629762</td>\n","      <td>bargain budget conscious stayed diamond palace...</td>\n","      <td>poor</td>\n","      <td>[bargain budget conscious stayed diamond palac...</td>\n","      <td>[0.011012056842446327, 0.03317748382687569, 0....</td>\n","      <td>bargain budget conscious stayed diamond palace...</td>\n","      <td>poor</td>\n","      <td>6193</td>\n","    </tr>\n","    <tr>\n","      <th>21</th>\n","      <td>0.687743</td>\n","      <td>different wonderful place wife took quick trip...</td>\n","      <td>great</td>\n","      <td>[different wonderful place wife took quick tri...</td>\n","      <td>[-0.01771906204521656, 0.04389231652021408, -0...</td>\n","      <td>different wonderful place wife took quick trip...</td>\n","      <td>great</td>\n","      <td>4466</td>\n","    </tr>\n","    <tr>\n","      <th>22</th>\n","      <td>0.626671</td>\n","      <td>pleasant stay, not sure hotel getting high rat...</td>\n","      <td>average</td>\n","      <td>[pleasant stay, not sure hotel getting high ra...</td>\n","      <td>[0.05981390178203583, 0.05227690935134888, 0.0...</td>\n","      <td>pleasant stay, not sure hotel getting high rat...</td>\n","      <td>great</td>\n","      <td>5947</td>\n","    </tr>\n","    <tr>\n","      <th>23</th>\n","      <td>0.622244</td>\n","      <td>great location friendly staff stayed hotel eas...</td>\n","      <td>great</td>\n","      <td>[great location friendly staff stayed hotel ea...</td>\n","      <td>[0.04760528728365898, 0.057427771389484406, -0...</td>\n","      <td>great location friendly staff stayed hotel eas...</td>\n","      <td>great</td>\n","      <td>5984</td>\n","    </tr>\n","    <tr>\n","      <th>24</th>\n","      <td>0.757443</td>\n","      <td>indifferent reviewing hotel 4.5 star rating ex...</td>\n","      <td>average</td>\n","      <td>[indifferent reviewing hotel 4.5 star rating e...</td>\n","      <td>[0.03320983797311783, 0.018339596688747406, -0...</td>\n","      <td>indifferent reviewing hotel 4.5 star rating ex...</td>\n","      <td>poor</td>\n","      <td>2922</td>\n","    </tr>\n","    <tr>\n","      <th>25</th>\n","      <td>0.674891</td>\n","      <td>overpriced trendy hotel clean did provide expe...</td>\n","      <td>poor</td>\n","      <td>[overpriced trendy hotel clean did provide exp...</td>\n","      <td>[0.06268326193094254, 0.05868948996067047, -0....</td>\n","      <td>overpriced trendy hotel clean did provide expe...</td>\n","      <td>poor</td>\n","      <td>2431</td>\n","    </tr>\n","    <tr>\n","      <th>26</th>\n","      <td>0.789453</td>\n","      <td>nasty little hotel, hated, two-star hotel pass...</td>\n","      <td>poor</td>\n","      <td>[nasty little hotel, hated, two-star hotel pas...</td>\n","      <td>[0.056497860699892044, 0.03789917752146721, -0...</td>\n","      <td>nasty little hotel, hated, two-star hotel pass...</td>\n","      <td>poor</td>\n","      <td>1330</td>\n","    </tr>\n","    <tr>\n","      <th>27</th>\n","      <td>0.917323</td>\n","      <td>not horrible not great booked room priority cl...</td>\n","      <td>poor</td>\n","      <td>[not horrible not great booked room priority c...</td>\n","      <td>[0.05109235644340515, -0.002740392927080393, 0...</td>\n","      <td>not horrible not great booked room priority cl...</td>\n","      <td>poor</td>\n","      <td>3133</td>\n","    </tr>\n","    <tr>\n","      <th>28</th>\n","      <td>0.402551</td>\n","      <td>great location right price capital hotel large...</td>\n","      <td>average</td>\n","      <td>[great location right price capital hotel larg...</td>\n","      <td>[0.032248347997665405, 0.03452647104859352, 0....</td>\n","      <td>great location right price capital hotel large...</td>\n","      <td>great</td>\n","      <td>5813</td>\n","    </tr>\n","    <tr>\n","      <th>29</th>\n","      <td>0.447795</td>\n","      <td>glitches nice wife decided long weekend includ...</td>\n","      <td>average</td>\n","      <td>[glitches nice wife decided long weekend inclu...</td>\n","      <td>[0.032219260931015015, 0.05903381109237671, 0....</td>\n","      <td>glitches nice wife decided long weekend includ...</td>\n","      <td>great</td>\n","      <td>6006</td>\n","    </tr>\n","    <tr>\n","      <th>30</th>\n","      <td>0.629044</td>\n","      <td>best location beach, husband traveled honolulu...</td>\n","      <td>great</td>\n","      <td>[best location beach, husband traveled honolul...</td>\n","      <td>[0.023886611685156822, 0.05207139998674393, 0....</td>\n","      <td>best location beach, husband traveled honolulu...</td>\n","      <td>great</td>\n","      <td>1979</td>\n","    </tr>\n","    <tr>\n","      <th>31</th>\n","      <td>0.655476</td>\n","      <td>maison st charles highly recommended husband s...</td>\n","      <td>great</td>\n","      <td>[maison st charles highly recommended husband ...</td>\n","      <td>[0.029143501073122025, 0.05877670273184776, 0....</td>\n","      <td>maison st charles highly recommended husband s...</td>\n","      <td>great</td>\n","      <td>702</td>\n","    </tr>\n","    <tr>\n","      <th>32</th>\n","      <td>0.657997</td>\n","      <td>loved maison st charles quality inn time new o...</td>\n","      <td>great</td>\n","      <td>[loved maison st charles quality inn time new ...</td>\n","      <td>[-0.004672225099056959, 0.05431671440601349, 0...</td>\n","      <td>loved maison st charles quality inn time new o...</td>\n","      <td>great</td>\n","      <td>3571</td>\n","    </tr>\n","    <tr>\n","      <th>33</th>\n","      <td>0.658032</td>\n","      <td>just ok mariott condado beach relatively par e...</td>\n","      <td>average</td>\n","      <td>[just ok mariott condado beach relatively par ...</td>\n","      <td>[-0.026672502979636192, 0.06595038622617722, -...</td>\n","      <td>just ok mariott condado beach relatively par e...</td>\n","      <td>great</td>\n","      <td>1821</td>\n","    </tr>\n","    <tr>\n","      <th>34</th>\n","      <td>0.728509</td>\n","      <td>intercontential hotel good expectations not ex...</td>\n","      <td>poor</td>\n","      <td>[intercontential hotel good expectations not e...</td>\n","      <td>[-0.011664708144962788, 0.06257296353578568, 0...</td>\n","      <td>intercontential hotel good expectations not ex...</td>\n","      <td>poor</td>\n","      <td>3312</td>\n","    </tr>\n","    <tr>\n","      <th>35</th>\n","      <td>0.382764</td>\n","      <td>great facilities service lacklustre good thing...</td>\n","      <td>average</td>\n","      <td>[great facilities service lacklustre good thin...</td>\n","      <td>[0.06111598387360573, 0.043490465730428696, -0...</td>\n","      <td>great facilities service lacklustre good thing...</td>\n","      <td>poor</td>\n","      <td>3012</td>\n","    </tr>\n","    <tr>\n","      <th>36</th>\n","      <td>0.961093</td>\n","      <td>terrible place stay family miami fl plus coupl...</td>\n","      <td>poor</td>\n","      <td>[terrible place stay family miami fl plus coup...</td>\n","      <td>[-0.05480073392391205, 0.04913101717829704, 0....</td>\n","      <td>terrible place stay family miami fl plus coupl...</td>\n","      <td>poor</td>\n","      <td>720</td>\n","    </tr>\n","    <tr>\n","      <th>37</th>\n","      <td>0.591355</td>\n","      <td>hated hilton times square location good broadw...</td>\n","      <td>poor</td>\n","      <td>[hated hilton times square location good broad...</td>\n","      <td>[0.029626626521348953, 0.06088785454630852, 0....</td>\n","      <td>hated hilton times square location good broadw...</td>\n","      <td>great</td>\n","      <td>826</td>\n","    </tr>\n","    <tr>\n","      <th>38</th>\n","      <td>0.917227</td>\n","      <td>bad choice, booked hotel hot wire called immed...</td>\n","      <td>poor</td>\n","      <td>[bad choice, booked hotel hot wire called imme...</td>\n","      <td>[0.04970594495534897, 0.02396448515355587, 0.0...</td>\n","      <td>bad choice, booked hotel hot wire called immed...</td>\n","      <td>poor</td>\n","      <td>5010</td>\n","    </tr>\n","    <tr>\n","      <th>39</th>\n","      <td>0.960052</td>\n","      <td>miss hotel great location doubts recommending ...</td>\n","      <td>poor</td>\n","      <td>[miss hotel great location doubts recommending...</td>\n","      <td>[0.059027474373579025, 0.022951629012823105, 0...</td>\n","      <td>miss hotel great location doubts recommending ...</td>\n","      <td>poor</td>\n","      <td>6127</td>\n","    </tr>\n","    <tr>\n","      <th>40</th>\n","      <td>0.616965</td>\n","      <td>not bad certainly nothing great group rooms co...</td>\n","      <td>average</td>\n","      <td>[not bad certainly nothing great group rooms c...</td>\n","      <td>[0.03939132019877434, 0.018286321312189102, 0....</td>\n","      <td>not bad certainly nothing great group rooms co...</td>\n","      <td>great</td>\n","      <td>584</td>\n","    </tr>\n","    <tr>\n","      <th>41</th>\n","      <td>0.965633</td>\n","      <td>hotels came n.o, jan 23-27 2005. standard jacu...</td>\n","      <td>poor</td>\n","      <td>[hotels came n.o, jan 23-27 2005. standard jac...</td>\n","      <td>[-0.03817267715930939, 0.014341816306114197, -...</td>\n","      <td>hotels came n.o, jan 23-27 2005. standard jacu...</td>\n","      <td>poor</td>\n","      <td>1037</td>\n","    </tr>\n","    <tr>\n","      <th>42</th>\n","      <td>0.504387</td>\n","      <td>ok better stayed hotel november took mum birth...</td>\n","      <td>average</td>\n","      <td>[ok better stayed hotel november took mum birt...</td>\n","      <td>[0.055742740631103516, 0.051377054303884506, 0...</td>\n","      <td>ok better stayed hotel november took mum birth...</td>\n","      <td>poor</td>\n","      <td>2375</td>\n","    </tr>\n","    <tr>\n","      <th>43</th>\n","      <td>0.385632</td>\n","      <td>3rd time stayed vintage park 8 years ago, kimp...</td>\n","      <td>great</td>\n","      <td>[3rd time stayed vintage park 8 years ago, kim...</td>\n","      <td>[0.05347459018230438, 0.05611817166209221, 0.0...</td>\n","      <td>3rd time stayed vintage park 8 years ago, kimp...</td>\n","      <td>poor</td>\n","      <td>3374</td>\n","    </tr>\n","    <tr>\n","      <th>44</th>\n","      <td>0.659771</td>\n","      <td>good bang buck winter 2003 hotel barcelo bavar...</td>\n","      <td>average</td>\n","      <td>[good bang buck winter 2003 hotel barcelo bava...</td>\n","      <td>[-0.02383992448449135, 0.053295351564884186, -...</td>\n","      <td>good bang buck winter 2003 hotel barcelo bavar...</td>\n","      <td>great</td>\n","      <td>3655</td>\n","    </tr>\n","    <tr>\n","      <th>45</th>\n","      <td>0.545162</td>\n","      <td>excellent local hotel chain did n't know expec...</td>\n","      <td>great</td>\n","      <td>[excellent local hotel chain did n't know expe...</td>\n","      <td>[0.04762573167681694, 0.0546991229057312, -0.0...</td>\n","      <td>excellent local hotel chain did n't know expec...</td>\n","      <td>great</td>\n","      <td>5079</td>\n","    </tr>\n","    <tr>\n","      <th>46</th>\n","      <td>0.950085</td>\n","      <td>outta fast, place complete dump, wish read rev...</td>\n","      <td>poor</td>\n","      <td>[outta fast, place complete dump, wish read re...</td>\n","      <td>[0.025958692654967308, 0.01543063297867775, -0...</td>\n","      <td>outta fast, place complete dump, wish read rev...</td>\n","      <td>poor</td>\n","      <td>2348</td>\n","    </tr>\n","    <tr>\n","      <th>47</th>\n","      <td>0.584074</td>\n","      <td>good location 5 girls stayed crest suite 1 bed...</td>\n","      <td>great</td>\n","      <td>[good location 5 girls stayed crest suite 1 be...</td>\n","      <td>[0.0578380823135376, 0.0559656023979187, 0.006...</td>\n","      <td>good location 5 girls stayed crest suite 1 bed...</td>\n","      <td>great</td>\n","      <td>5292</td>\n","    </tr>\n","    <tr>\n","      <th>48</th>\n","      <td>0.940189</td>\n","      <td>avoid just wanted say went hotel family 4. pla...</td>\n","      <td>poor</td>\n","      <td>[avoid just wanted say went hotel family 4. pl...</td>\n","      <td>[0.015874499455094337, 0.06213773414492607, -0...</td>\n","      <td>avoid just wanted say went hotel family 4. pla...</td>\n","      <td>poor</td>\n","      <td>1497</td>\n","    </tr>\n","    <tr>\n","      <th>49</th>\n","      <td>0.636614</td>\n","      <td>believe, awesome stay royal service, wife stay...</td>\n","      <td>great</td>\n","      <td>[believe, awesome stay royal service, wife sta...</td>\n","      <td>[-0.034960873425006866, 0.03390227630734444, -...</td>\n","      <td>believe, awesome stay royal service, wife stay...</td>\n","      <td>great</td>\n","      <td>3213</td>\n","    </tr>\n","  </tbody>\n","</table>\n","</div>"],"text/plain":["    trained_classifier_confidence_confidence  ... origin_index\n","0                                   0.545426  ...          577\n","1                                   0.697417  ...         5746\n","2                                   0.671591  ...          675\n","3                                   0.969960  ...         4415\n","4                                   0.668067  ...         4099\n","5                                   0.516283  ...         2621\n","6                                   0.484881  ...         2165\n","7                                   0.648219  ...         2522\n","8                                   0.561468  ...         5187\n","9                                   0.822802  ...          488\n","10                                  0.434855  ...         1944\n","11                                  0.639247  ...         4809\n","12                                  0.912950  ...         1825\n","13                                  0.607228  ...         1663\n","14                                  0.822926  ...         1031\n","15                                  0.657127  ...         6332\n","16                                  0.508753  ...         4253\n","17                                  0.615702  ...         2886\n","18                                  0.937799  ...         5804\n","19                                  0.960582  ...         1390\n","20                                  0.629762  ...         6193\n","21                                  0.687743  ...         4466\n","22                                  0.626671  ...         5947\n","23                                  0.622244  ...         5984\n","24                                  0.757443  ...         2922\n","25                                  0.674891  ...         2431\n","26                                  0.789453  ...         1330\n","27                                  0.917323  ...         3133\n","28                                  0.402551  ...         5813\n","29                                  0.447795  ...         6006\n","30                                  0.629044  ...         1979\n","31                                  0.655476  ...          702\n","32                                  0.657997  ...         3571\n","33                                  0.658032  ...         1821\n","34                                  0.728509  ...         3312\n","35                                  0.382764  ...         3012\n","36                                  0.961093  ...          720\n","37                                  0.591355  ...          826\n","38                                  0.917227  ...         5010\n","39                                  0.960052  ...         6127\n","40                                  0.616965  ...          584\n","41                                  0.965633  ...         1037\n","42                                  0.504387  ...         2375\n","43                                  0.385632  ...         3374\n","44                                  0.659771  ...         3655\n","45                                  0.545162  ...         5079\n","46                                  0.950085  ...         2348\n","47                                  0.584074  ...         5292\n","48                                  0.940189  ...         1497\n","49                                  0.636614  ...         3213\n","\n","[50 rows x 8 columns]"]},"metadata":{"tags":[]},"execution_count":4}]},{"cell_type":"markdown","metadata":{"id":"lVyOE2wV0fw_"},"source":["#4. Test the fitted pipe on new example"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":80},"id":"qdCUg2MR0PD2","executionInfo":{"status":"ok","timestamp":1620191368924,"user_tz":-300,"elapsed":258538,"user":{"displayName":"Gammer Otaku","photoUrl":"","userId":"18042713576744284398"}},"outputId":"5fe7ab38-19fd-469c-8d84-a5d31a4b86cf"},"source":["fitted_pipe.predict(\"It was a good experince!\")"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/html":["<div>\n","<style scoped>\n","    .dataframe tbody tr th:only-of-type {\n","        vertical-align: middle;\n","    }\n","\n","    .dataframe tbody tr th {\n","        vertical-align: top;\n","    }\n","\n","    .dataframe thead th {\n","        text-align: right;\n","    }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\n","  <thead>\n","    <tr style=\"text-align: right;\">\n","      <th></th>\n","      <th>trained_classifier_confidence_confidence</th>\n","      <th>sentence</th>\n","      <th>sentence_embedding_use</th>\n","      <th>document</th>\n","      <th>trained_classifier</th>\n","      <th>origin_index</th>\n","    </tr>\n","  </thead>\n","  <tbody>\n","    <tr>\n","      <th>0</th>\n","      <td>0.549657</td>\n","      <td>[It was a good experince!]</td>\n","      <td>[0.034853726625442505, 0.018303068354725838, -...</td>\n","      <td>It was a good experince!</td>\n","      <td>great</td>\n","      <td>0</td>\n","    </tr>\n","  </tbody>\n","</table>\n","</div>"],"text/plain":["   trained_classifier_confidence_confidence  ... origin_index\n","0                                  0.549657  ...            0\n","\n","[1 rows x 6 columns]"]},"metadata":{"tags":[]},"execution_count":5}]},{"cell_type":"markdown","metadata":{"id":"xflpwrVjjBVD"},"source":["## Configure pipe training parameters"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"UtsAUGTmOTms","executionInfo":{"status":"ok","timestamp":1620191368985,"user_tz":-300,"elapsed":258592,"user":{"displayName":"Gammer Otaku","photoUrl":"","userId":"18042713576744284398"}},"outputId":"316005d7-a943-4c86-8921-1c403304f5bd"},"source":["trainable_pipe.print_info()"],"execution_count":null,"outputs":[{"output_type":"stream","text":["The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :\n",">>> pipe['classifier_dl'] has settable params:\n","pipe['classifier_dl'].setMaxEpochs(3)                | Info: Maximum number of epochs to train | Currently set to : 3\n","pipe['classifier_dl'].setLr(0.005)                   | Info: Learning Rate | Currently set to : 0.005\n","pipe['classifier_dl'].setBatchSize(64)               | Info: Batch size | Currently set to : 64\n","pipe['classifier_dl'].setDropout(0.5)                | Info: Dropout coefficient | Currently set to : 0.5\n","pipe['classifier_dl'].setEnableOutputLogs(True)      | Info: Whether to use stdout in addition to Spark logs. | Currently set to : True\n",">>> pipe['use@tfhub_use'] has settable params:\n","pipe['use@tfhub_use'].setDimension(512)              | Info: Number of embedding dimensions | Currently set to : 512\n","pipe['use@tfhub_use'].setLoadSP(False)               | Info: Whether to load SentencePiece ops file which is required only by multi-lingual models. This is not changeable after it's set with a pretrained model nor it is compatible with Windows. | Currently set to : False\n","pipe['use@tfhub_use'].setStorageRef('tfhub_use')     | Info: unique reference name for identification | Currently set to : tfhub_use\n",">>> pipe['deep_sentence_detector@SentenceDetectorDLModel_c83c27f46b97'] has settable params:\n","pipe['deep_sentence_detector@SentenceDetectorDLModel_c83c27f46b97'].setExplodeSentences(False)  | Info: whether to explode each sentence into a different row, for better parallelization. Defaults to false. | Currently set to : False\n","pipe['deep_sentence_detector@SentenceDetectorDLModel_c83c27f46b97'].setStorageRef('SentenceDetectorDLModel_c83c27f46b97')  | Info: storage unique identifier | Currently set to : SentenceDetectorDLModel_c83c27f46b97\n","pipe['deep_sentence_detector@SentenceDetectorDLModel_c83c27f46b97'].setEncoder(com.johnsnowlabs.nlp.annotators.sentence_detector_dl.SentenceDetectorDLEncoder@113937db)  | Info: Data encoder | Currently set to : com.johnsnowlabs.nlp.annotators.sentence_detector_dl.SentenceDetectorDLEncoder@113937db\n","pipe['deep_sentence_detector@SentenceDetectorDLModel_c83c27f46b97'].setImpossiblePenultimates(['Bros', 'No', 'al', 'vs', 'etc', 'Fig', 'Dr', 'Prof', 'PhD', 'MD', 'Co', 'Corp', 'Inc', 'bros', 'VS', 'Vs', 'ETC', 'fig', 'dr', 'prof', 'PHD', 'phd', 'md', 'co', 'corp', 'inc', 'Jan', 'Feb', 'Mar', 'Apr', 'Jul', 'Aug', 'Sep', 'Sept', 'Oct', 'Nov', 'Dec', 'St', 'st', 'AM', 'PM', 'am', 'pm', 'e.g', 'f.e', 'i.e'])  | Info: Impossible penultimates | Currently set to : ['Bros', 'No', 'al', 'vs', 'etc', 'Fig', 'Dr', 'Prof', 'PhD', 'MD', 'Co', 'Corp', 'Inc', 'bros', 'VS', 'Vs', 'ETC', 'fig', 'dr', 'prof', 'PHD', 'phd', 'md', 'co', 'corp', 'inc', 'Jan', 'Feb', 'Mar', 'Apr', 'Jul', 'Aug', 'Sep', 'Sept', 'Oct', 'Nov', 'Dec', 'St', 'st', 'AM', 'PM', 'am', 'pm', 'e.g', 'f.e', 'i.e']\n","pipe['deep_sentence_detector@SentenceDetectorDLModel_c83c27f46b97'].setModelArchitecture('cnn')  | Info: Model architecture (CNN) | Currently set to : cnn\n",">>> pipe['document_assembler'] has settable params:\n","pipe['document_assembler'].setCleanupMode('shrink')  | Info: possible values: disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full | Currently set to : shrink\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"2GJdDNV9jEIe"},"source":["##5.  Retrain with new parameters"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":759},"id":"mptfvHx-MMMX","executionInfo":{"status":"ok","timestamp":1620191380476,"user_tz":-300,"elapsed":270072,"user":{"displayName":"Gammer Otaku","photoUrl":"","userId":"18042713576744284398"}},"outputId":"6d96e2b0-8c38-4495-9ad5-c8a0abeaa7b0"},"source":["# Train longer!\n","trainable_pipe = nlu.load('train.classifier')\n","trainable_pipe['trainable_classifier_dl'].setMaxEpochs(5)  \n","fitted_pipe = trainable_pipe.fit(train_df.iloc[:100])\n","# predict with the trainable pipeline on dataset and get predictions\n","preds = fitted_pipe.predict(train_df.iloc[:100],output_level='document')\n","\n","#sentence detector that is part of the pipe generates sone NaNs. lets drop them first\n","preds.dropna(inplace=True)\n","from sklearn.metrics import classification_report\n","print(classification_report(preds['y'], preds['classifier_dl']))\n","preds"],"execution_count":null,"outputs":[{"output_type":"stream","text":["              precision    recall  f1-score   support\n","\n","     average       0.67      0.35      0.46        34\n","       great       0.73      0.94      0.82        34\n","        poor       0.66      0.78      0.71        32\n","\n","    accuracy                           0.69       100\n","   macro avg       0.68      0.69      0.67       100\n","weighted avg       0.68      0.69      0.66       100\n","\n"],"name":"stdout"},{"output_type":"execute_result","data":{"text/html":["<div>\n","<style scoped>\n","    .dataframe tbody tr th:only-of-type {\n","        vertical-align: middle;\n","    }\n","\n","    .dataframe tbody tr th {\n","        vertical-align: top;\n","    }\n","\n","    .dataframe thead th {\n","        text-align: right;\n","    }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\n","  <thead>\n","    <tr style=\"text-align: right;\">\n","      <th></th>\n","      <th>trained_classifier_confidence_confidence</th>\n","      <th>text</th>\n","      <th>y</th>\n","      <th>sentence</th>\n","      <th>sentence_embedding_use</th>\n","      <th>document</th>\n","      <th>trained_classifier</th>\n","      <th>origin_index</th>\n","    </tr>\n","  </thead>\n","  <tbody>\n","    <tr>\n","      <th>0</th>\n","      <td>0.590834</td>\n","      <td>decent hotel decent price stayed 5 nights delu...</td>\n","      <td>average</td>\n","      <td>[decent hotel decent price stayed 5 nights del...</td>\n","      <td>[0.053218383342027664, 0.04507320374250412, 0....</td>\n","      <td>decent hotel decent price stayed 5 nights delu...</td>\n","      <td>poor</td>\n","      <td>577</td>\n","    </tr>\n","    <tr>\n","      <th>1</th>\n","      <td>0.993803</td>\n","      <td>good said previous posts small tasteful renova...</td>\n","      <td>great</td>\n","      <td>[good said previous posts small tasteful renov...</td>\n","      <td>[0.039876967668533325, 0.06624795496463776, -0...</td>\n","      <td>good said previous posts small tasteful renova...</td>\n","      <td>great</td>\n","      <td>5746</td>\n","    </tr>\n","    <tr>\n","      <th>2</th>\n","      <td>0.981204</td>\n","      <td>gold floor best stayed gold floor club floor f...</td>\n","      <td>great</td>\n","      <td>[gold floor best stayed gold floor club floor ...</td>\n","      <td>[0.0038577697705477476, 0.05996308475732803, -...</td>\n","      <td>gold floor best stayed gold floor club floor f...</td>\n","      <td>great</td>\n","      <td>675</td>\n","    </tr>\n","    <tr>\n","      <th>3</th>\n","      <td>0.880106</td>\n","      <td>truly awful, admit slightly sceptical arriving...</td>\n","      <td>poor</td>\n","      <td>[truly awful, admit slightly sceptical arrivin...</td>\n","      <td>[0.06336931884288788, 0.0006446511833928525, -...</td>\n","      <td>truly awful, admit slightly sceptical arriving...</td>\n","      <td>poor</td>\n","      <td>4415</td>\n","    </tr>\n","    <tr>\n","      <th>4</th>\n","      <td>0.989036</td>\n","      <td>union square jewel loved hotel ammenities loca...</td>\n","      <td>great</td>\n","      <td>[union square jewel loved hotel ammenities loc...</td>\n","      <td>[0.025744924321770668, 0.06057509407401085, 0....</td>\n","      <td>union square jewel loved hotel ammenities loca...</td>\n","      <td>great</td>\n","      <td>4099</td>\n","    </tr>\n","    <tr>\n","      <th>...</th>\n","      <td>...</td>\n","      <td>...</td>\n","      <td>...</td>\n","      <td>...</td>\n","      <td>...</td>\n","      <td>...</td>\n","      <td>...</td>\n","      <td>...</td>\n","    </tr>\n","    <tr>\n","      <th>95</th>\n","      <td>0.487027</td>\n","      <td>nice hotel needs little updating, good price, ...</td>\n","      <td>average</td>\n","      <td>[nice hotel needs little updating, good price,...</td>\n","      <td>[0.034561637789011, 0.05367675796151161, 0.012...</td>\n","      <td>nice hotel needs little updating, good price, ...</td>\n","      <td>great</td>\n","      <td>310</td>\n","    </tr>\n","    <tr>\n","      <th>96</th>\n","      <td>0.602283</td>\n","      <td>nice n't just returned stay macao feb 10-16/08...</td>\n","      <td>average</td>\n","      <td>[nice n't just returned stay macao feb 10-16/0...</td>\n","      <td>[-0.027111494913697243, 0.05856683477759361, 0...</td>\n","      <td>nice n't just returned stay macao feb 10-16/08...</td>\n","      <td>average</td>\n","      <td>5581</td>\n","    </tr>\n","    <tr>\n","      <th>97</th>\n","      <td>0.662033</td>\n","      <td>15-22, fiance went club carabela 15 22. 23 fia...</td>\n","      <td>average</td>\n","      <td>[15-22, fiance went club carabela 15 22. 23 fi...</td>\n","      <td>[-0.04782567545771599, 0.04917208105325699, 0....</td>\n","      <td>15-22, fiance went club carabela 15 22. 23 fia...</td>\n","      <td>average</td>\n","      <td>4554</td>\n","    </tr>\n","    <tr>\n","      <th>98</th>\n","      <td>0.876396</td>\n","      <td>quaint not rundown son decided celebrate gradu...</td>\n","      <td>poor</td>\n","      <td>[quaint not rundown son decided celebrate grad...</td>\n","      <td>[0.03745331987738609, 0.04204617813229561, -0....</td>\n","      <td>quaint not rundown son decided celebrate gradu...</td>\n","      <td>poor</td>\n","      <td>6498</td>\n","    </tr>\n","    <tr>\n","      <th>99</th>\n","      <td>0.740084</td>\n","      <td>overwhelming stay oct 19 26thnegatives arrival...</td>\n","      <td>average</td>\n","      <td>[overwhelming stay oct 19 26thnegatives arriva...</td>\n","      <td>[-0.0396517813205719, 0.017166994512081146, 0....</td>\n","      <td>overwhelming stay oct 19 26thnegatives arrival...</td>\n","      <td>poor</td>\n","      <td>5651</td>\n","    </tr>\n","  </tbody>\n","</table>\n","<p>100 rows × 8 columns</p>\n","</div>"],"text/plain":["    trained_classifier_confidence_confidence  ... origin_index\n","0                                   0.590834  ...          577\n","1                                   0.993803  ...         5746\n","2                                   0.981204  ...          675\n","3                                   0.880106  ...         4415\n","4                                   0.989036  ...         4099\n","..                                       ...  ...          ...\n","95                                  0.487027  ...          310\n","96                                  0.602283  ...         5581\n","97                                  0.662033  ...         4554\n","98                                  0.876396  ...         6498\n","99                                  0.740084  ...         5651\n","\n","[100 rows x 8 columns]"]},"metadata":{"tags":[]},"execution_count":7}]},{"cell_type":"markdown","metadata":{"id":"qFoT-s1MjTSS"},"source":["#6. Try training with different Embeddings"]},{"cell_type":"code","metadata":{"id":"nxWFzQOhjWC8","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1620191380992,"user_tz":-300,"elapsed":270578,"user":{"displayName":"Gammer Otaku","photoUrl":"","userId":"18042713576744284398"}},"outputId":"b7c7fdd6-b46c-4b82-9440-c7c2bcf87c64"},"source":["# We can use nlu.print_components(action='embed_sentence') to see every possibler sentence embedding we could use. Lets use bert!\n","nlu.print_components(action='embed_sentence')"],"execution_count":null,"outputs":[{"output_type":"stream","text":["For language <en> NLU provides the following Models : \n","nlu.load('en.embed_sentence') returns Spark NLP model tfhub_use\n","nlu.load('en.embed_sentence.use') returns Spark NLP model tfhub_use\n","nlu.load('en.embed_sentence.tfhub_use') returns Spark NLP model tfhub_use\n","nlu.load('en.embed_sentence.use.lg') returns Spark NLP model tfhub_use_lg\n","nlu.load('en.embed_sentence.tfhub_use.lg') returns Spark NLP model tfhub_use_lg\n","nlu.load('en.embed_sentence.albert') returns Spark NLP model albert_base_uncased\n","nlu.load('en.embed_sentence.electra') returns Spark NLP model sent_electra_small_uncased\n","nlu.load('en.embed_sentence.electra_small_uncased') returns Spark NLP model sent_electra_small_uncased\n","nlu.load('en.embed_sentence.electra_base_uncased') returns Spark NLP model sent_electra_base_uncased\n","nlu.load('en.embed_sentence.electra_large_uncased') returns Spark NLP model sent_electra_large_uncased\n","nlu.load('en.embed_sentence.bert') returns Spark NLP model sent_bert_base_uncased\n","nlu.load('en.embed_sentence.bert_base_uncased') returns Spark NLP model sent_bert_base_uncased\n","nlu.load('en.embed_sentence.bert_base_cased') returns Spark NLP model sent_bert_base_cased\n","nlu.load('en.embed_sentence.bert_large_uncased') returns Spark NLP model sent_bert_large_uncased\n","nlu.load('en.embed_sentence.bert_large_cased') returns Spark NLP model sent_bert_large_cased\n","nlu.load('en.embed_sentence.biobert.pubmed_base_cased') returns Spark NLP model sent_biobert_pubmed_base_cased\n","nlu.load('en.embed_sentence.biobert.pubmed_large_cased') returns Spark NLP model sent_biobert_pubmed_large_cased\n","nlu.load('en.embed_sentence.biobert.pmc_base_cased') returns Spark NLP model sent_biobert_pmc_base_cased\n","nlu.load('en.embed_sentence.biobert.pubmed_pmc_base_cased') returns Spark NLP model sent_biobert_pubmed_pmc_base_cased\n","nlu.load('en.embed_sentence.biobert.clinical_base_cased') returns Spark NLP model sent_biobert_clinical_base_cased\n","nlu.load('en.embed_sentence.biobert.discharge_base_cased') returns Spark NLP model sent_biobert_discharge_base_cased\n","nlu.load('en.embed_sentence.covidbert.large_uncased') returns Spark NLP model sent_covidbert_large_uncased\n","nlu.load('en.embed_sentence.small_bert_L2_128') returns Spark NLP model sent_small_bert_L2_128\n","nlu.load('en.embed_sentence.small_bert_L4_128') returns Spark NLP model sent_small_bert_L4_128\n","nlu.load('en.embed_sentence.small_bert_L6_128') returns Spark NLP model sent_small_bert_L6_128\n","nlu.load('en.embed_sentence.small_bert_L8_128') returns Spark NLP model sent_small_bert_L8_128\n","nlu.load('en.embed_sentence.small_bert_L10_128') returns Spark NLP model sent_small_bert_L10_128\n","nlu.load('en.embed_sentence.small_bert_L12_128') returns Spark NLP model sent_small_bert_L12_128\n","nlu.load('en.embed_sentence.small_bert_L2_256') returns Spark NLP model sent_small_bert_L2_256\n","nlu.load('en.embed_sentence.small_bert_L4_256') returns Spark NLP model sent_small_bert_L4_256\n","nlu.load('en.embed_sentence.small_bert_L6_256') returns Spark NLP model sent_small_bert_L6_256\n","nlu.load('en.embed_sentence.small_bert_L8_256') returns Spark NLP model sent_small_bert_L8_256\n","nlu.load('en.embed_sentence.small_bert_L10_256') returns Spark NLP model sent_small_bert_L10_256\n","nlu.load('en.embed_sentence.small_bert_L12_256') returns Spark NLP model sent_small_bert_L12_256\n","nlu.load('en.embed_sentence.small_bert_L2_512') returns Spark NLP model sent_small_bert_L2_512\n","nlu.load('en.embed_sentence.small_bert_L4_512') returns Spark NLP model sent_small_bert_L4_512\n","nlu.load('en.embed_sentence.small_bert_L6_512') returns Spark NLP model sent_small_bert_L6_512\n","nlu.load('en.embed_sentence.small_bert_L8_512') returns Spark NLP model sent_small_bert_L8_512\n","nlu.load('en.embed_sentence.small_bert_L10_512') returns Spark NLP model sent_small_bert_L10_512\n","nlu.load('en.embed_sentence.small_bert_L12_512') returns Spark NLP model sent_small_bert_L12_512\n","nlu.load('en.embed_sentence.small_bert_L2_768') returns Spark NLP model sent_small_bert_L2_768\n","nlu.load('en.embed_sentence.small_bert_L4_768') returns Spark NLP model sent_small_bert_L4_768\n","nlu.load('en.embed_sentence.small_bert_L6_768') returns Spark NLP model sent_small_bert_L6_768\n","nlu.load('en.embed_sentence.small_bert_L8_768') returns Spark NLP model sent_small_bert_L8_768\n","nlu.load('en.embed_sentence.small_bert_L10_768') returns Spark NLP model sent_small_bert_L10_768\n","nlu.load('en.embed_sentence.small_bert_L12_768') returns Spark NLP model sent_small_bert_L12_768\n","For language <fi> NLU provides the following Models : \n","nlu.load('fi.embed_sentence') returns Spark NLP model sent_bert_finnish_cased\n","nlu.load('fi.embed_sentence.bert.cased') returns Spark NLP model sent_bert_finnish_cased\n","nlu.load('fi.embed_sentence.bert.uncased') returns Spark NLP model sent_bert_finnish_uncased\n","For language <xx> NLU provides the following Models : \n","nlu.load('xx.embed_sentence') returns Spark NLP model sent_bert_multi_cased\n","nlu.load('xx.embed_sentence.bert') returns Spark NLP model sent_bert_multi_cased\n","nlu.load('xx.embed_sentence.bert.cased') returns Spark NLP model sent_bert_multi_cased\n","nlu.load('xx.embed_sentence.labse') returns Spark NLP model labse\n"],"name":"stdout"}]},{"cell_type":"code","metadata":{"id":"IKK_Ii_gjJfF","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1620201488264,"user_tz":-300,"elapsed":6142302,"user":{"displayName":"Gammer Otaku","photoUrl":"","userId":"18042713576744284398"}},"outputId":"d071f07a-afcf-454f-87b5-9ce4c0e8bf86"},"source":["trainable_pipe = nlu.load('en.embed_sentence.small_bert_L12_768 train.classifier')\n","# We need to train longer and user smaller LR for NON-USE based sentence embeddings usually\n","# We could tune the hyperparameters further with hyperparameter tuning methods like gridsearch\n","# Also longer training gives more accuracy\n","trainable_pipe['trainable_classifier_dl'].setMaxEpochs(90)  \n","trainable_pipe['trainable_classifier_dl'].setLr(0.0005) \n","fitted_pipe = trainable_pipe.fit(train_df)\n","# predict with the trainable pipeline on dataset and get predictions\n","preds = fitted_pipe.predict(train_df,output_level='document')\n","\n","#sentence detector that is part of the pipe generates sone NaNs. lets drop them first\n","preds.dropna(inplace=True)\n","print(classification_report(preds['y'], preds['classifier_dl']))\n","\n","#preds"],"execution_count":null,"outputs":[{"output_type":"stream","text":["sent_small_bert_L12_768 download started this may take some time.\n","Approximate size to download 392.9 MB\n","[OK!]\n","sentence_detector_dl download started this may take some time.\n","Approximate size to download 354.6 KB\n","[OK!]\n","              precision    recall  f1-score   support\n","\n","     average       0.67      0.61      0.64      1758\n","       great       0.77      0.84      0.80      1727\n","        poor       0.77      0.77      0.77      1756\n","\n","    accuracy                           0.74      5241\n","   macro avg       0.74      0.74      0.74      5241\n","weighted avg       0.74      0.74      0.74      5241\n","\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"_1jxw3GnVGlI"},"source":["# 7 evaluate on Test Data"]},{"cell_type":"code","metadata":{"id":"Fxx4yNkNVGFl","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1620201993037,"user_tz":-300,"elapsed":504888,"user":{"displayName":"Gammer Otaku","photoUrl":"","userId":"18042713576744284398"}},"outputId":"8a4eedf0-5a5d-496a-c0fa-7968fa030f1c"},"source":["preds = fitted_pipe.predict(test_df,output_level='document')\n","\n","#sentence detector that is part of the pipe generates sone NaNs. lets drop them first\n","preds.dropna(inplace=True)\n","print(classification_report(preds['y'], preds['classifier_dl']))"],"execution_count":null,"outputs":[{"output_type":"stream","text":["              precision    recall  f1-score   support\n","\n","     average       0.61      0.65      0.63       426\n","       great       0.78      0.78      0.78       457\n","        poor       0.78      0.73      0.75       428\n","\n","    accuracy                           0.72      1311\n","   macro avg       0.72      0.72      0.72      1311\n","weighted avg       0.72      0.72      0.72      1311\n","\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"2BB-NwZUoHSe"},"source":["# 8. Lets save the model"]},{"cell_type":"code","metadata":{"id":"eLex095goHwm","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1620202238341,"user_tz":-300,"elapsed":245312,"user":{"displayName":"Gammer Otaku","photoUrl":"","userId":"18042713576744284398"}},"outputId":"5d08a028-29d1-48db-858d-74c162d89b69"},"source":["stored_model_path = './models/classifier_dl_trained' \n","fitted_pipe.save(stored_model_path)"],"execution_count":null,"outputs":[{"output_type":"stream","text":["Stored model in ./models/classifier_dl_trained\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"e_b2DPd4rCiU"},"source":["# 9. Lets load the model from HDD.\n","This makes Offlien NLU usage possible!   \n","You need to call nlu.load(path=path_to_the_pipe) to load a model/pipeline from disk."]},{"cell_type":"code","metadata":{"id":"SO4uz45MoRgp","colab":{"base_uri":"https://localhost:8080/","height":80},"executionInfo":{"status":"ok","timestamp":1620202254043,"user_tz":-300,"elapsed":15721,"user":{"displayName":"Gammer Otaku","photoUrl":"","userId":"18042713576744284398"}},"outputId":"a0c1f0d3-e6a8-4dd6-f926-284545527afd"},"source":["hdd_pipe = nlu.load(path=stored_model_path)\n","\n","preds = hdd_pipe.predict('It was a good experince!')\n","preds"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/html":["<div>\n","<style scoped>\n","    .dataframe tbody tr th:only-of-type {\n","        vertical-align: middle;\n","    }\n","\n","    .dataframe tbody tr th {\n","        vertical-align: top;\n","    }\n","\n","    .dataframe thead th {\n","        text-align: right;\n","    }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\n","  <thead>\n","    <tr style=\"text-align: right;\">\n","      <th></th>\n","      <th>from_disk_confidence_confidence</th>\n","      <th>text</th>\n","      <th>sentence</th>\n","      <th>sentence_embedding_from_disk</th>\n","      <th>from_disk</th>\n","      <th>document</th>\n","      <th>origin_index</th>\n","    </tr>\n","  </thead>\n","  <tbody>\n","    <tr>\n","      <th>0</th>\n","      <td>[0.89922845]</td>\n","      <td>It was a good experince!</td>\n","      <td>[It was a good experince!]</td>\n","      <td>[[-0.1282019317150116, 0.30001381039619446, 0....</td>\n","      <td>[great]</td>\n","      <td>It was a good experince!</td>\n","      <td>8589934592</td>\n","    </tr>\n","  </tbody>\n","</table>\n","</div>"],"text/plain":["  from_disk_confidence_confidence  ... origin_index\n","0                    [0.89922845]  ...   8589934592\n","\n","[1 rows x 7 columns]"]},"metadata":{"tags":[]},"execution_count":12}]},{"cell_type":"code","metadata":{"id":"e0CVlkk9v6Qi","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1620202254063,"user_tz":-300,"elapsed":63,"user":{"displayName":"Gammer Otaku","photoUrl":"","userId":"18042713576744284398"}},"outputId":"d39ac458-5cba-4e4d-b0f2-3b4b0c088c30"},"source":["hdd_pipe.print_info()"],"execution_count":null,"outputs":[{"output_type":"stream","text":["The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :\n",">>> pipe['document_assembler'] has settable params:\n","pipe['document_assembler'].setCleanupMode('shrink')                                     | Info: possible values: disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full | Currently set to : shrink\n",">>> pipe['sentence_detector@SentenceDetectorDLModel_c83c27f46b97'] has settable params:\n","pipe['sentence_detector@SentenceDetectorDLModel_c83c27f46b97'].setExplodeSentences(False)  | Info: whether to explode each sentence into a different row, for better parallelization. Defaults to false. | Currently set to : False\n","pipe['sentence_detector@SentenceDetectorDLModel_c83c27f46b97'].setStorageRef('SentenceDetectorDLModel_c83c27f46b97')  | Info: storage unique identifier | Currently set to : SentenceDetectorDLModel_c83c27f46b97\n","pipe['sentence_detector@SentenceDetectorDLModel_c83c27f46b97'].setEncoder(com.johnsnowlabs.nlp.annotators.sentence_detector_dl.SentenceDetectorDLEncoder@11f6140a)  | Info: Data encoder | Currently set to : com.johnsnowlabs.nlp.annotators.sentence_detector_dl.SentenceDetectorDLEncoder@11f6140a\n","pipe['sentence_detector@SentenceDetectorDLModel_c83c27f46b97'].setImpossiblePenultimates(['Bros', 'No', 'al', 'vs', 'etc', 'Fig', 'Dr', 'Prof', 'PhD', 'MD', 'Co', 'Corp', 'Inc', 'bros', 'VS', 'Vs', 'ETC', 'fig', 'dr', 'prof', 'PHD', 'phd', 'md', 'co', 'corp', 'inc', 'Jan', 'Feb', 'Mar', 'Apr', 'Jul', 'Aug', 'Sep', 'Sept', 'Oct', 'Nov', 'Dec', 'St', 'st', 'AM', 'PM', 'am', 'pm', 'e.g', 'f.e', 'i.e'])  | Info: Impossible penultimates | Currently set to : ['Bros', 'No', 'al', 'vs', 'etc', 'Fig', 'Dr', 'Prof', 'PhD', 'MD', 'Co', 'Corp', 'Inc', 'bros', 'VS', 'Vs', 'ETC', 'fig', 'dr', 'prof', 'PHD', 'phd', 'md', 'co', 'corp', 'inc', 'Jan', 'Feb', 'Mar', 'Apr', 'Jul', 'Aug', 'Sep', 'Sept', 'Oct', 'Nov', 'Dec', 'St', 'st', 'AM', 'PM', 'am', 'pm', 'e.g', 'f.e', 'i.e']\n","pipe['sentence_detector@SentenceDetectorDLModel_c83c27f46b97'].setModelArchitecture('cnn')  | Info: Model architecture (CNN) | Currently set to : cnn\n",">>> pipe['bert_sentence@sent_small_bert_L12_768'] has settable params:\n","pipe['bert_sentence@sent_small_bert_L12_768'].setBatchSize(8)                           | Info: Size of every batch | Currently set to : 8\n","pipe['bert_sentence@sent_small_bert_L12_768'].setCaseSensitive(False)                   | Info: whether to ignore case in tokens for embeddings matching | Currently set to : False\n","pipe['bert_sentence@sent_small_bert_L12_768'].setDimension(768)                         | Info: Number of embedding dimensions | Currently set to : 768\n","pipe['bert_sentence@sent_small_bert_L12_768'].setMaxSentenceLength(128)                 | Info: Max sentence length to process | Currently set to : 128\n","pipe['bert_sentence@sent_small_bert_L12_768'].setIsLong(False)                          | Info: Use Long type instead of Int type for inputs buffer - Some Bert models require Long instead of Int. | Currently set to : False\n","pipe['bert_sentence@sent_small_bert_L12_768'].setStorageRef('sent_small_bert_L12_768')  | Info: unique reference name for identification | Currently set to : sent_small_bert_L12_768\n",">>> pipe['classifier_dl@sent_small_bert_L12_768'] has settable params:\n","pipe['classifier_dl@sent_small_bert_L12_768'].setClasses(['average', 'great', 'poor'])  | Info: get the tags used to trained this ClassifierDLModel | Currently set to : ['average', 'great', 'poor']\n","pipe['classifier_dl@sent_small_bert_L12_768'].setStorageRef('sent_small_bert_L12_768')  | Info: unique reference name for identification | Currently set to : sent_small_bert_L12_768\n"],"name":"stdout"}]}]}